﻿ Linking Book Characters Toward A Corpus Encoding Relations Between Entities Dan Cristea Eugen Ignat “Alexandru Ioan Cuza” University of Iași, Department of UAIC-FII Computer Science (UAIC-FII) eugen ignat@info uaic ro Institute for Computer Science, Romanian Academy, Iași branch dcristea@info uaic ro Abstract— What does a novel bring to a reader? What can it connect the text of a book, as well as its readers, with bring to a machine? Are there chances that a machine will information extracted from the Web and in the reality around decipher the messages a book expresses in free language? Part of A MappedBook is a book connected with locations/events in the content of a text is encoded in relations between entities In the virtual and real world and sensible to the instantaneous order to decode them, algorithms make use of learning location (as seized by the mobile/tablet) of a reader The techniques in which the training is guided by corpora that make explicit entities and relations The creation of a gold corpus to be information made available could possibly be different used in training and evaluation is therefore of a primary concern depending on the moment and the place of the reader For This paper proposes annotation conventions and methodological instance, if the user reads in a touristic guide about a museum prerequisites for the creation of a corpus that puts in evidence and her/his momentary position happens to be in its proximity, characters in a book and relations that are mentioned as holding the system will try to find out from the web the open hours of between them, of the types: anaphoric, affective, kinship and the museum Then, in case there is still time for a visit, it will social The language under investigation is Romanian and the signal to the user about this possibility This way, the text of type of text used is fiction, but the proposed conventions are the touristic guide comes to life and becomes intimately thought to be applicable to any language and type of text connected to the reader Features of interactive social gaming could put in correspondence children emerged into a learning- Keywords— entity linking; anaphoric relations; semantic by-visiting activity related to Geography or tourists’ journeys relations; annotated corpora; annotation conventions; content analysis; text analytics; text understanding; XML The MappingBooks technology puts in the same melting pot known and new techniques to create multi-dimensional I INTRODUCTION mash-ups combining textual, geographical and even temporal information in order to present them adequately to a reader, Books are written for humans, not for machines Same as intermediated by a specially-designed human-computer music, as paintings, as theatre or as dance, books are intended interface The links should be sensible to the context of the to impress us, but more than all these other forms of art, they mentions in the book, the moment the user initiates an access can also bring us knowledge We are able to learn out of books and the current location of the user The technology is intended about past events, behaviours of famous people, or places to make heavy use of entity linking techniques, making we’ve never been in yet Non trivially, books influence our possible to spot the mentioned entities in the book (persons and personality, change our perspective over life, modify our way locations) in the real and the virtual world of thinking To what extend can we exploit the books’ content more than by keeping them under the eyes and reading them? Out of the MappingBooks proposal (that involves many To what extend could a machine touch the content hidden in other issues related to mixed reality and geo-informatics) we books in order to present us in an organised manner? In what refer here only to the text analytics and content extraction way would work a technology able to conceptualise parts of the machinery We take the MappingBooks setting to exemplify knowledge encrypted in books? This paper is a preliminary possible applications of an entity linking technology study toward these goals In this paper we discuss a representational mechanism that In general, the entity linking task is defined as the process makes possible the text analytics and entity linking techniques of linking named entities found in unstructured texts to records that should be put at the core of a technology as the one of a knowledge base The task is thought to be important for described in MappingBooks Especially, we concentrate on several information extraction and natural language processing building a corpus annotated with information about a diversity applications, such as search in unstructured texts and of types of entity mentions and four types of relations summarization (anaphoric, affective, kinship and social), out of which a system would learn to detect by itself the searched for entity Parts of this paper are based on a project proposal that mentions and the relations inter-connecting them presents a technology, called MappingBooks, intended to These are a number of features that characterise our The knowledge NELL extracts is of a static nature, since no approach: 1) the mentions we want to connect are not temporal relations are taken into consideration Other necessarily entities with names, but general noun phrases (with approaches address the challenge to determine time constraints pronominal, common or proper nouns heads and modifiers); 2) as general sequencing knowledge, out of which to infer the there are no preliminary records about the entities linked, so the moments or the periods of time some particular events had knowledge base evolves from scratch; 3) we try to connect in occurred anaphoric coreferential chains all mentions of the same entity, each such chain being attached to exactly one entity in the The 2012 Text Analysis Conference (TAC) launched the universe of the text; 4) we are looking only for a well Knowledge Base Population (KBP) Cold Start task, requiring established set of relations these entities are involved in; 5) the systems to take a set of documents and produce a texts we investigate are fiction books, therefore, on one hand, comprehensive set of triples that for each analysis, the reference area is well delimited and, on encode relationships involving named-entities TAC-2012 used the other hand, the form of expression is totally unrestricted a fixed schema of 42 relations and their logical inverses One of systems participating in this task report between 30-80% The paper includes the following sections: Section II gives accuracy, by manually examining ten random samples for each a brief state of the art of the emerging entity linking domain, relations in the absence of a gold corpus The system, called mentioning some of its most known methods and basic KELVIN, integrates a pipeline of processing tools, among techniques Section III presents a proposal for annotating a which a basic tool is the BBN’s SERIF (Statistical Entity & corpus of entities and relations and Section IV includes a Relation Information Finding) SERIF does named-entities discussion and some concluding remarks identification and classification by type and subtype, intra- document co-reference analysis, including named, nominal and II PRIOR ART AND BASIC LEVEL PROCESSING pronominal mentions, sentence parsing, in order to extract In entity linking, the extracted data are stored in a intra-sentential relations between entities, and detection of knowledge base (KB), which is continuously evolving during certain types of events SERIF provides a considerable suite of the process New extractions must be merged with already document annotations that have been used as basis for building existent information, which imposes care for determining when an initial KB a match occurs or when new entries should be created in the But an accurate entity linking technique is dependent on a KB Depending on the application, the source data is a closed diversity of NLP mechanisms, which should work well in collection (for instance a book) or is open – the whole web, or correlation An aphorism which circulates in NLP circles, only parts of it (Wikipedia, DBpedia, touristic pages, mass-referring only to a part of the domain, anaphora resolution, says media reflected on the Web, social-media, blogs, etc ) that once you have done everything in NLP you solved The most important challenges in entity linking address: anaphora for free Ad-hoc linking techniques are on the class of name variations (different text strings in the source text refer one-document and cross-document anaphora resolution [9, 10, the same KB entity), ambiguities (there is more than one entity 11, 12] RARE, the UAIC-FII’s system of anaphora resolution, in the KB a string can refer to) and absence (there is no entity relying on a mixed approach which combines symbolic rules description in the KB to which a string in the source text with learning techniques, has given good results for Bulgarian, representing an entity could possibly match) Bunescu and German, Greek, English, Polish, and Romanian [13, 14, 15, Pasca and Cucerzan presented important pioneering 16] work in this area Cucerzan does not handle absences while Research on detection of content features at the pre- Bunescu and Pasca address them by learning a threshold Some syntactic, syntactic and semantic levels, such as noun phrases, approaches use supervised machine learning For instance, Rao dependency relations and semantic roles, were the subject of et al score entities contained in the KB for a possible match many studies Here we mention only a few, pursued within the to the query entity NLP-Group@UAIC-FII, reporting technologies that correlate NELL (Never-Ending Language Learning) is a project1 well with the issue discussed in this paper: noun phrase intending to make a computer learn to “read the web” It chunking2 , dependency parsing , clause splitting [19, browses hundreds of millions of web pages and applies several 16] Textual entailment is a problem of central concern in extraction methods to detect plausible facts involving entities mastering the diversity of expressions with similar content [20, and categories These facts are then used to enhance its 21] and techniques of semantic roles labelling are a source technique, such that, in time, it becomes more and more of good hints is searching verbal relations between entities competent The initial input of NELL has been an ontology Also, an application of entity linking, the issue of extracting including several hundreds of categories and binary relations patterns of geographic knowledge in order to map a textual The “beliefs” NELL is continuously extracting from the web description of a travel onto GoogleMaps, was studied in two are used as a self-supervised collection of training examples graduation thesis Over its exploitation, the extracted knowledge had to be As in many applications of NLP, entity linking too cannot manually revised several times as to eliminate erroneous facts, be done without proper collections of annotated texts, where which, let to proliferate, would have deteriorate its future human linguistic expertise is used to make explicit the deep behaviour semantic knowledge of the kind the future technology is 12 http://rtw ml cmu edu/rtw/ Part of them are active web services at http://nlptools info uaic ro/ supposed to discover by itself But the success of the attempt to constructions, prepositional phrases), but they do not extend acquire this kind of superior corpus relies very much on the also over relative clauses maturity of basic NLP technologies (including at least: tokenization, part-of-speech tagging and lemmatization, noun We will consider that each RE evokes an entity, which is phrase chunking, name entity recognition, and segmentation at part either of the same text as the RE itself, or of the virtual sentence level), since basic elements are necessary for world (as a reflection of the real world) By borrowing a term automatic processing and a massive manual annotation of them from Centering , we will say that each RE realises an is not feasible, because of extremely high costs and time entity constraints As such, there are two ways to go further in this - If two nested noun phrases share the same syntactic head, enterprise: either a manually annotated corpus including basic only the largest is retained as a RE If two REs are elements already exists and it is further annotated by experts at intersectable, they are necessarily nested Nested REs should higher levels, or a virgin text is chosen, on which basic have distinctive heads, which also means that they cannot linguistic processing is launched, and human expertise realise identical entities (or, following the terminology from involving the superior levels is added only on top of the section III B, they cannot co-refer) For instance the lady with automatically marked elements The second methodology has the red hat and the red hat realise distinct entities the advantage that the gold corpus and the test corpus (as well as the future texts processed by the technology) are - In general, entities have types: PERSON, LOCATION, characterised by similar annotation accuracies at basic levels ORGANISATION, DATE, ADDRESS, OTHER, but in the (given by the basic NLP chains) If this decision for building exercise described in this paper, we are interested only in the gold corpus is adopted, extracting content from free texts is PERSON type entities (including groups) usually the ultimate step of a long process addressing basic text - When entities have associated descriptions, it is important analytics to distinguish between identification and characterisation The techniques doing this processing seem to have reached descriptions Only identification descriptions are included in technological maturity, as proved by recent European projects REs For instance, in acest bărbat, stricat până-n măduva CLARIN3, METANET4U4, ATLAS5 (to mention only some in oaselor7 the which the authors have been involved) For instance, Anechitei span stricat până-n măduva oaselor is a characterisation description It does not help NLP chain6 in the range 80% – 97% for Romanian and in the identification of the entity, of a man among many On English Moreover, the same research shows how individual the other hand, in the man with straw hat the span straw hat is tools can be easily combined in processing chains by using an identification description It contributes in distinguishing UIMA-based interfaces Similar type of processing is used one man in a group and it must be part of the RE for the aggregation of language processing pipelines into NLP applications in METANET4U, where the U-Compare CAS - We say that an entity is included if it is not realised in the text In Romanian, included entities appear only in the position interface , a dialect of UIMA, has been used of subject We annotate such entities only if they participate in a relation that is among those decided to be annotated For III ANNOTATION CONVENTIONS instance, in dar îl și iubeau din tot sufletul , the subject of iubeau is included paper we are concerned in proposing adequate annotation Here, the morpho-syntactic properties of the included entity are conventions that would support the development of a gold preserved in the person and number of the verb We will corpus We discuss in this section two levels of annotation annotate 1:[îl] and 2:[iubeau; REALISATION=INCLUDED]) above the basic markings which includes token and noun din tot sufletul, therefore, the verb as the second entity, phrase boundaries and word level morphological information participating in an affective relationship (love, as described in One level puts in evidence entities; the other marks relations Section III B) between them We adopted an XML notation for entities, that shows the span over which it extends and the type As noted, an entity A Annotating entities marking should extend over a noun phrase, which includes the Syntactically, at the text level, the entities are signalled by component tokens (words) The XML element for a word is W noun phrases Their annotation is important because they can and, apart from an ID, it includes morpho-syntactic attributes, participate in relations To refer to them we will use a term among which – LEMMA The element denoting an entity is which is common in anaphora studies: referential expressions ENTITY, with attributes: ID and TYPE (and optionally, (REs) Here are some rules for recognising REs: HEAD) As explained, for included subjects the verb is - REs have nominal or pronominal heads, may include modifiers (determiners, adjectives, numerals, genitival 7 Most of the examples in this paper are taken from the Romanian version of Quo Vadis, by Henryk Sienkiewicz, București, 1967, in the translation from 3 http://www clarin eu/external/ Polish by Remus Luca and Elena Lință The English equivalents have been 4 http://metanet4u eu/ searched in the translation from Polish by Jeremiah Curtin (project 5 http://www atlasproject eu/ GUTENBERG eBook Quo Vadis) All examples are marked in italics and the 6 http://www atlasproject eu/atlas/file/911996c9-0b0a-48ef-868a-English equivalents follow the Romanian examples in angular brackets, like 7a483081e8f0/13ccaaa0-a0c0-11e0-8264- 0800200c9a66/Finalreport txt this: Romanian example annotated instead, which adds to the ENTITY element also the copil => isa ; class-of (the inverse of isa, therefore from a B Annotating anaphoric relations • concept to an instance): 1:[Ion] a devenit 2:[profesor de Anaphora is known to be a relation between two referential matematică] => class-of ; another referential element, usually called antecedent If the element the anaphor depends on precedes the anaphor we call • part-of (a component part of an assembly), for the relation anaphora, otherwise – cataphora We use the term instance: luându-1:[i] 2:[mâinile] => part-of 9 coreferences (identity of reference), but also other types (part- has-as-part (the inverse of part-of: X has- of, class-of, etc ) – the complete list is given below in this • -part Y if X has Y as a component part), for section as instance: 1:[Ochii] 2:[îi] 3:[ai; Following are a number of hints that guided the process of REALISATION=INCLUDED] de la tata sau de la annotation of anaphoric relations: mama? => has ) Note that there are two ways to (imbrication here means overlapping of REs) are (and coref directed in the text right-to-left, therefore from the read omul cu mâna în ghips , namely as 1:[omul cu 2:[mâna în ghips]] corresponds to the usual interpretation done by humans and as during reading: the current referential expression is 3:[omul cu 4:[mâna] în ghips]] and only in the second interpretation -as-part holds, because only a unfolded text, therefore to the left of the RE under a relation has scrutiny The same direction is considered also in case hand can be part of a man, not also a hand in plaster of cataphora, see • subgroup-of (from a subgroup to a larger group, • The anaphoric relations between imbricated entities, by which includes it), for instance: Hristos 1:[i]-a iertat și convention, are annotated from the most largest RE pe 2:[evreii] care l-au dus la moarte și pe 3:[soldații towards the included one, for instance: 1:[femei din romani] care l-au țintuit pe cruce => member-of to death, just as 3:[the Roman soldiers] who have nailed him against the cross > => subgroup-of ; The following types of anaphoric relations are annotated: subgroup-of • coref (coreferentiality, symmetrical) Example: • has-as-subgroup (the inverse of subgroup-of, 1:[Acteea]… 2:[tânara libertă] => coref ; ocazia zilei tatălui, 1:[băieții] s-au reîntâlnit cu • member-of (from one element to the group the 2:[familia] => has femei din societatea înaltă; subgroup has-name (the relationship between an entity and its • has-as-member (the inverse of member-of: from • a group to one element belonging to it), for instance: name), for instance: Atunci 1:[“sagatio”], cum numeau "1:[Petronius] 2:[Vinicius]… 3:[amândurora]" 2:[aruncarea în sus pe pelerina soldățească]… => the 1:[“sagatio”], as they termed 2:[the tossing]> => -name ; has-as-member ; has-as-member ; has 1:[X], 2:[Y], 3:[Z] și 4:[alte 5:[femei din 6:[societatea • name-of (the inverse of has-name, i e from a name înaltă]]] => has-as-member , numea 1:[aceste incursiuni] 2:[pescuit de perle] => 2:[Ligia]… 3:[you PL]> => has-as-member ; name-of ; 1:[numele lui 2:[Aulus]] => name-of • isa (from an instance to its conceptual class), for All types of anaphoric relations mentioned above are example: 1:[Profesor de matematică] s-a 2:[făcut; doubled by a similar number of cases in which the anaphoricity REALISATION=INCLUDED], așa cum a visat încă de 8 The translation in English adopts an unnatural word order, but is left as such to preserve the isa relation, same as in Romanian 9 Note that the two REs are non-imbricated in the original Romanian and imbricated in the English version However, in both variants the sense of the relation is conformant with the conventions stated above, in this section is dependent on the interpretation of a character, or is doubtful, those of the missing element (for instance, the main or not yet realised To put in evidence such person-mediated verb in case of a null subject) Example: dar 1:[îl] și interpretations, the name of the relations are complemented 2:[iubeau; REALISATION=INCLUDED]) din tot with the ending “-interpret” We discuss below only some sufletul => love , trigger: • coref-interpret (coreference by virtue of the interpretation given by a character, following the vision The following types of non-anaphoric relations are of someone, also a symmetrical relation) Sometimes annotated: the coref-interpret relation holds between a predicative name and a subject: Nu avea nici cea mai • Social relations: superior-of, inferior-of mică îndoială că 1:[lucrătorul acela] e 2:[Ursus] => Examples: in 1:[împăratul] şi 2: 4:[lui] de coref-interpret ; L-am prezentat pe 1:[acest Glaucus] frunte] dormeau încă , the => coref-inferior-of is curtenii lui de frunte , because it includes all relevant elements, the two poles and the trigger: inferior-of , • class-of-interpret (a relation from a concept to trigger: Let’s note also that is in a coref an instance of it, in the vision of someone) Examples: relation with , placed outside the span of the social dar nu ești 1:[tu] 2:[un zeu]? (but aren’t 1:[you] 2:[a relation God]?) => class-of-interpret ; dați-mi- 1:[o] de 2:[nevastă] (give 1:[her] to me as 2:[wife]) => • Affective relations: friend-of, enemy-of, love, class-of-interpret hate and worship Examples: 1: 3:[săi]] => friend-of , The XML element for marking anaphoric relations is trigger: ; 1:[oamenii aceia] nu numai că-și REFERENTIAL, with attributes: ID, FROM (indicating the ID 2:[slăveau] 3:[zeul] => worship , also an ENTITY element) and TYPE (one of the types trigger: ; 1:[Ligia] îngenunche ca să se 2:[roage] discussed above) The REFERENTIAL elements do not mark 3:[altcuiva] => worship , trigger C Annotating non-anaphoric relations • Kinship relations: parent-of (includes any relation Before inventorying the types of non-anaphoric relations, between parents and children), child-of (inverse of we state below a number of principles we adhered to while parent-of), grandparent-of, grandchild- marking them: of (inverse of grandparent-of), sibling • The marking should indicate the minimal span that (symmetrical, between brothers and sisters), ant- includes the two poles (arguments of the relation) and uncle-of, nephew-of (inverse of ant-uncle- the trigger (a word making the relation explicit in the of), cousin-of (symmetrical, between cousins), text) Usually the name of the relation should be spouse-of (symmetrical, between husbands), identical with the trigger’s lemma, or should be a unknown (when the kinship type is not mentioned) synonym or a hypernym of it Among the two poles, We do not put in evidence the gender of the two poles, one is the source (or actant) and the other – the therefore no distinction is made between father, destinator (or passive), such that the relation should be mother, son, daughter, etc as well as the number of read in the text between the source pole and the actants and passives Examples: 1: 3:[lor]] destination pole This criterion indicates the sense of => parent-of , trigger: the relation and is supposed to disambiguate between a ; 1:[o 2:[rudă] de-a lui 3:[Petronius]] => unknown , trigger: poles are realised by intersecting or non-intersecting spans of text Example of non-intersecting poles: The XML notations for non-anaphoric relations include the 1:[Ion] 2:[o] 3:[iubește] pe 4:[Maria] => love , trigger: (also attributes: ID, FROM, TO, TRIGGER, TYPE The markings coref ), and of intersecting poles: should delimitate the boundaries 1: lui 3:[Ion]] => superior-of , trigger: Vinicius] era 2: 4: 6:[sale] mai mari]], 7:[care], cu ani în urmă, se 8:[căsătorise] cu 9: • When one pole is missing, as for instance in case of an 11:[acestuia]], 12: pe vremea lui 14:[Tiberiu]] included subject, it is replaced in the annotation by the The entities are shown in square brackets and lui the following relations have to be marked: Tiberiu - referential: coref ; coref ; coref ; class-of ; - kinship: sibling , trigger: ; child-of , trigger: ; spouse-of , trigger: ; parent-of , trigger: ; FROM="E12" TO="E8" - social: inferior-of , trigger: 10 Figure 1 shows the XML correspondent notations For not marking boundaries, are indicated here as stand-off Fig 1 Example of an annotation Marcus Vinicius IV A CORPUS INCORPORATING ENTITY LINKS era A Building the corpus A group of master students in Computational Linguistics, first year, annotated the corpus in a collaborative work Each of fiul them received approximately 10 pages of text from the XML conventions and the annotation rules have been surorii established in the group along a number of weekly debates that lasted more than a couple of months, by discussing many sale examples Before starting the manual annotation activity, the text of the whole book was passed through an initial NLP chain mai 11 mari which included a tokeniser, a POS-tagger and a lemmatiser The students were instructed to mark in a first step the entities and only in a second step the relations The annotation tool 12 used was PALinkA , A number of limitations have been set, with the intention to reduce the complexity of the task Here are they: • We did not annotate negated relations For instance, no care relation is marked in case the verb linking the subject , to the predicative noun is negated: 1:[Ligia] nu poate cu să devină 2:[amanta nimănui] ani become 2:[the concubine of any man]> => and în urmă , are not linked by a (negated) coreferential relation; se 1:[Vinicius]… Nu ești 2:[un oarecare] și nu ai 3:[un căsătorise chip de rând] cu man] nor 3:[foolish] > => no relation is marked between and , nor between and tatăl • Characterisations addressing subjects, expressed by predicative nouns, are not marked Example: Și, ca un acestuia cunoscător, înțelese că 1:[este; TYPE=INCLUDED] 2:[o ființă deosebită] understood that in 1:[her] there was 2:[something uncommon]> => no relation is marked between and • In this corpus, no interpreted relations, besides consul coreferential are marked yet For instance, in the pe following span, the name-of-interpret relation vremea is not marked between and : Petronius… care 1011 Note that in the English variant, the social relation is not that Web services of the NLP-Group@UAIC-FII, at http://nlptools info uaic ro/ evident as in the Romanian variant 12 http://clg wlv ac uk/projects/PALinkA, author Constantin Orasan simțea că pe statuia 1:[acestei fete] trebuia scris V CONCLUSIONS 2:[“Primăvara”] Deep understanding of texts is a challenge that is still far from being accomplished Meanwhile, emergent domains • Care was taken where lexicals used as triggers had approach sub-tasks of this giant enterprise, among them: word different senses than those implied by the relation For sense disambiguation, syntactic parsing, discourse parsing, instance, in the context La botezul sfânt mi s-a dat metaphor interpretation, textual entailment, semantic role numele de Urban, părinte , the word părinte linking The task is so difficult because it necessitates large is not a trigger for a parent-of type of corpora annotated with expert knowledge, at different layers of kinship relation, because its sense here is priest interpretation of language Entity linking or the task of • In case of coreferentiality (the anaphoric relations recognising relations linking mentions of entities in texts, is coref), the annotators were instructed to mark no perhaps, by its ambitions, the most close to this distant goal A more than N-1 relations in a coreferential cluster of N text contains static descriptions about entities, events relying REs, therefore the minimum number of relations which them, general statements about world or about a small part of are necessary to recover the complete cluster by it The segmentation that is natural in language, as given by symmetry and transitivity This allows for exactly one syntax (words, clauses, sentences) offer the ingredients of the initial member (first mention) and exactly one language structure, but no one knows yet in what terms a coreference link for each succeeding mentions of the representation should be stated In principle, such a same entity Also, any member of the cluster can be representation should be stable to variations which do not chosen as antecedent of any of the but-first members of confuse the content and to any source languages the cluster13 In the following example: Se repezi la The research described in this paper relates to the aspect of 1:[Petru] și, luându-2:[i] 3:[mâinile], începu să 4:[i] inventing a representation for deep text understanding that 5:[le] sărute => coref , part-of (or Criteria to annotate referential expressions that realise entities part-of ), coref (or coref ), have been established Then we focussed on four types of coref As such, marking part-of is relations between entities of type person and proposed superfluous, because it results, by transitivity, from conventions for annotating them in XML coref and part-of In the following span 1:[Numele de familie al lui 2:[Ion]] este The research showed that the human annotators, although 3:[Rădulescu] the anaphoric relations can be responded well to our training in the attempt to annotate annotated two ways: name-of (and not unambiguously relations between entities In order to ease the has-name , because the convention in case of annotation process and to reduce at minimum possibilities of anaphoric relations between nested REs, as stated in variation due to personal interpretations, a number of conventions were established Among such constraints were, Section III B, is from the outer RE towards the inner RE), plus coref An equivalent notation is for instance, those intended to detect the exact span of referential expressions, the borders delimiting relations, the name-of , coref criteria to decide between a relation and its inverse in case of In its present shape the corpus covers 1,500 sentences and non-symmetrical relations, when and how to mark a null includes: 3663 referential expressions, 2045 anaphoric links, pronoun, etc 39 affective relations, 21 kinship relations and 15 social Further research will concentrate on two directions First relations In the training sessions, the students were instructed the corpus must be enlarged, its accuracy – augmented and its to give more importance to the correctness of annotations then completion – scrutinised One activity we have to develop in to their completion Moreover, the annotations received a the near future will be to monitor parallel annotations done by second look through by the second author In this initial phase different subjects and measure the inter-annotator agreement of the research, we will mainly observe the precision of an When accomplished, this direction will produce a gold corpus automatic parser then its recall that will sustain the second line of research: the development of entity linking algorithms trained and, afterwards, validated on the corpus ACKNOWLEDGMENT 13We are grateful to the class of first year master students in This is true even in cases when the chain includes a proper noun followed by a number of pronouns and the coreferentiality is decided between some Computational Linguistics, from the Faculty of Computer pairs of pronouns, provided (at least) one of them is also found coreferential Science, “Alexandru Ioan Cuza” University of Iași, who, with the initial proper noun Let’s note that it is important to know that a during the second term of the university year 2012-2013, series of pronouns refer the same entity The proper identity of these mentions, once decided for one of them, is transferred by symmetry and realised the “Quo Vadis” corpus described in this paper transitivity to all members of the cluster 14 In the English equivalent, two mentions of Peter ( and ) are missing) REFERENCES D Cristea, O Postolache, “How to deal with wicked anaphora” In R C Bunescu and M Pasca, “Using encyclopedic knowledge for named Antonio Branco, Tony McEnery, Ruslan Mitkov (eds ): Anaphora entity disambiguation” European Chapter of the Assocation for Processing: Linguistic, Cognitive and Computational Modelling, Computational Linguistics (EACL), 2006 Benjamin Publishing Books, 2005 S Cucerzan, “Large-scale named entity disambiguation based on D Anechitei, D Cristea, I Dimosthenis, E Ignat, D Karagiozov, S wikipedia data” Empirical Methods in Natural Language Processing Koeva, M Kopeć, C Vertan, “Summarizing Short Texts Through a (EMNLP), 2007 Discourse-Centered Approach in a Multilingual Context” In Neustein, A , Markowitz, J A (eds ), Where Humans Meet Machines: Innovative D Rao, P McNamee, and M Dredze, “Entity Linking: Finding Solutions to Knotty Natural Language Problems Springer Verlag, Extracted Entities in a Knowledge Base”, in Thierry Poibeau, Horacio Heidelberg/New York, 2013 Saggion, Jakub Piskorski and RomanYangarber (eds ) Multisource, Multilingual Information Extraction and Summarization, Springer R Simionescu “Romanian Deep Noun Phrase Chunking Using Lecture Notes in Computer Science, Berlin, Heidelberg, 2012 Graphical Grammar Studio” In Proceedings of the ConsILR conference 2011-2012 A Carlson, J Betteridge, E R Hruschka Jr and T M Mitchell, “Coupling Semi-Supervised Learning of Categories and Relations” M Colhon, “Syntactic Translation Patterns from a Parallel Treebank”, In Proceedings of the NAACL HLT Workshop on Semi-supervised Proceedings of the First Workshop on Computational Linguistics of Learning for Natural Language Processing, 2009 Balkan Languages (CLoBL 2012), The 5th Balkan Conference in Informatics – BCI 2012, Novi Sad, Serbia, September 16-20, ISBN 978- T Mohamed, E R Hruschka Jr and T M Mitchell, “Discovering 86-7031-200-5, pp 85-88, 2012 Relations between Noun Categories” In Proceedings of the Conference D Anechitei, “Multilingual Discourse Processing” Disertation thesis, on Empirical Methods in Natural Language Processing (EMNLP), 2011 “Alexandru Ioan Cuza” University of Iași, Department of computer P P Talukdar, D T Wijaya and T M Mitchell, “Acquiring Temporal Science, 2012 Constraints between Relations” In Proceedings of the Conference on A Iftene, “Textual Entailment”, PhD thesis, “Alexandru Ioan Cuza” Information and Knowledge Management (CIKM), 2012 University of Iaşi, Department of Computer Science, Iași, 2009 P McNamee, J Mayfield, T Finin, T Oates, D Lawrie, T Xu, and D W Oard, “KELVIN: a tool for automated knowledge base A Moruz, “Predication Driven Textual Entailment”, PhD thesis, construction” In Proceedings of the NAACL HLT 2013 Demonstration “Alexandru Ioan Cuza” University of Iaşi, Department of Computer Session, pp 32-35, Atlanta, 10-12 June 2013 Science, Iași, 2011 E Boschee, R Weischedel, and A Zamanian “Automatic information D Trandabăț, “Natural Language Processing Using Semantic Roles”, extraction” In Proceedings of the 2005 International Conference on PhD thesis, “Alexandru Ioan Cuza” University of Iaşi, Department of Intelligence Analysis, McLean, VA, pages 2–4, 2005 Computer Science, Iași, 2010 A Bagga and B Balwdin, “Entity-based cross-document coreferencing G Cărăușu, “Processing Spatial Relations In Old Texts And Their using the Vector Space Model”, in COLING '98 Proceedings, vol 1, Transposition On Modern Maps” (in Romanian: “Prelucrarea expresiilor 1998 spațiale în textele vechi pentru realizarea echivalărilor topografice în hărțile actuale”), graduation thesis, Faculty of Computer Science, A Bagga and B Balwdin, “Cross-document event coreference: “Alexandru Ioan Cuza” University of Iași, 2011 annotations, experiments, and observations”, CorefApp '99 Proceedings, 1999 A M Ciucanu, “Iter in Chinam – Reconstructing the Journey of Milescu Spătarul from Russia to China” (in Romanian “Iter in Chinam – H Saggion, “SHEF: semantic tagging and summarization techniques Reconstituirea traseului lui Milescu Spătarul din Rusia până în China”), applied to cross-document coreference, SEMEVLA '07 Proceedings, graduation thesis, Faculty of Computer Science, “Alexandru Ioan Cuza” 2007 University of Iași, 2011 S Singh, A Subramanya, F Pereira and A McCallum, “Large-scale D Ferruci, A Lally, “UIMA: an architectural approach to unstructured cross-document coreference using distributed inference and hierarchical information processing in the corporate research environment”, Natural models", HLT '11 Proceedings, vol 1, 2011 Language Engineering 10, No 3-4, pp 327-348, 2004 D Cristea and G E Dima, “An integrating framework for anaphora K Yoshinobu, W A Baumgartner Jr , L McCrohon, S Ananiadou, K B resolution” Information Science and Technology, Romanian Academy Cohen, L Hunter and J Tsujii, “U-Compare: share and compare text Publishing House, Bucharest, vol 4, no 3-4, p 273-291, 2001 mining tools with UIMA”, Bioinformatics 25(15), pp 1997-1998, 2009 C Orăsan, D Cristea, R Mitkov, A H Branco, “Anaphora Resolution B J Grosz, A Joshi, S Weinstein, “Centering – a framework for Exercise - An Overview” In Proceedings of the International modelling the local coherence of discourse” In: Computation Conference on Language Resources and Evaluation (LREC), Marrakech, Linguistics, 12(2), June 1995 26 May - 1 June 2008 